What and why?
One-parameter models
Normal model
Markov chain Monte Carlo
I have 50% probability of getting a head when flipping a coin
An Olympic shooter has 95% probability of hitting the target
Weather forecast indicates 80% probability of raining tomorrow
A cancer patient was told that he had 60% chance of surviving for at least 5 years
However, repeated experiments are not possible in many situations
Different people have different probabilities regarding the same event
The same person’s subjective probability is likely to change as more information becomes available
Here are several statements about John. After reading them, what is your belief regarding whether he is a thief?
There are two quantities of central interest
Bayesian inference begins with a formulation of joint beliefs about \(\theta\) and \(Y\)
Once the dataset \(Y\) becomes available, the last step is to update our belif about \(\theta\)
\[p(\theta|y) = \frac{p(y|\theta) p(\theta)}{\int p(y|\tilde\theta )p(\tilde\theta) \,\mathrm{d}\tilde\theta}\]
Suppose half of the sampled individuals are infected, i.e. \(Y=10\)
\[ \begin{align*} C^{-1} = & \int_0^1 \theta^{a+Y-1}(1-\theta)^{b+n-Y-1}\,\mathrm{d}\theta \\ = & B(a+Y, b+n-Y) \end{align*} \]
\[ \begin{align*} & P(\tilde Y = k| Y) \\= & \int P(\tilde Y=k|\theta, Y)p(\theta|Y)\,\mathrm{d}\theta \\ = & \int_0^1 {n\choose k} \theta^k (1-\theta)^{n-k}\cdot \frac{\theta^{a+Y-1}(1-\theta)^{b+n-Y-1}}{B(a+Y, b+n-Y)} \,\mathrm{d}\theta \\ =& {n\choose k}\frac{B(a+Y+k, b+2n-Y-k)}{B(a+Y, b+n-Y)} \end{align*} \]
pred_dist <- function(k, Y, a, b, n) {
choose(n, k) * beta(a + Y + k, b + 2 * n - Y - k)/beta(a + Y, b + n - Y)
}
a <- 5
b <- 45
n <- 20
Y <- 10
pred_dist(seq(0, n), Y, a, b, n)
## [1] 1.485e-02 6.022e-02 1.254e-01 1.776e-01 1.914e-01 1.662e-01 1.205e-01
## [8] 7.440e-02 3.970e-02 1.845e-02 7.492e-03 2.661e-03 8.235e-04 2.207e-04
## [15] 5.065e-05 9.792e-06 1.556e-06 1.957e-07 1.831e-08 1.136e-09 3.511e-11
Use \([\theta_{\alpha/2}, \theta_{1-\alpha/2}]\), where \(\theta_{\alpha/2}\) and \(\theta_{1-\alpha/2}\) are posterior quantiles \[P(\theta < \theta_{\alpha/2}|Y=y) = P(\theta > \theta_{1-\alpha/2}|Y=y) =\alpha/2\]
Easy to implement
a <- 5; b <- 45; n <- 20; Y <- 10;
qbeta(c(0.025, 0.975), a + Y, b + n - Y)
## [1] 0.1271 0.3169
If the posterior distribution is not symmetric, a quantile-based interval excludes some points that have higher density than some points inside the interval
May not be an interval if the posterial density is multimodal, i.e. has multiple peaks
a <- 5; b <- 45; n <- 20; Y <- 10;
require(hdrcde)
as.vector(hdr(rbeta(1e5, a + Y, b + n - Y), prob=95)$hdr)
## [1] 0.1213 0.3098
\[ \begin{align*} E[\theta|Y_1,\ldots,Y_n] =& \frac{a+\sum_{i=1}^nY_i}{b+n} \\ =& \frac{n}{b+n}\cdot \hat\theta + \frac{b}{b+n}\cdot \theta_0 \end{align*} \]
a <- 2; b <- 1; Y <- 5;
dnbinom(seq(0, 10), a + Y, 1 / (b + 2))
## [1] 0.0004572 0.0021338 0.0056902 0.0113804 0.0189673 0.0278187 0.0370916
## [8] 0.0459229 0.0535768 0.0595297 0.0634984
\[p(y|\mu, \sigma^2) =\frac{1}{\sqrt{2\pi\sigma^2}}\textstyle\exp\left(-\frac{(y-\mu)^2}{2\sigma^2}\right),\quad -\infty < y < \infty\]
the distribution is symmetric about \(\theta\), and the mode, median and mean are all equal to \(\theta\)
about 95% of the population lies within two standard deviations of the mean (more precisely, 1.96 standard deviations)
In general, given i.i.d. normal data \(Y_1,\ldots,Y_n\), if we select a normal prior on \(\mu\), then \(p(\mu|Y_1,\ldots,Y_n,\sigma^2)\) is normal with mean \(\mu_n\) and variance \(\tau_n^2\), where \[\mu_n = \frac{\frac{1}{\tau_0^2}\mu_0+\frac{n}{\sigma^2}\bar{Y}}{\frac{1}{\tau_0^2}+\frac{n}{\sigma^2}}\quad\mbox{and}\quad \tau_n^2=\frac{1}{\frac{1}{\tau_0^2}+\frac{n}{\sigma^2}}\]
Posterior mean is a weighted average of the prior mean and the sample mean \[\mu_n = \frac{\tau_0^{-2}}{\tau_0^{-2}+n\sigma^{-2}}\mu_0 + \frac{n\sigma^{-2}}{\tau_0^{-2}+n\sigma^{-2}}\bar{Y}\]
## [1] 1.82 2.30 1.64 1.52 1.72 1.36 1.74 2.08 1.97
revenue <- c(1.82, 2.3, 1.64, 1.52, 1.72, 1.36, 1.74, 2.08, 1.97)
n <- length(revenue)
Y_bar <- mean(revenue)
s2 <- var(revenue)
mu_0 <- 2
tau_0 <- 0.6
a <- 1/tau_0^2 + n/s2
b <- mu_0/tau_0^2 + n/s2 * Y_bar
tau_n <- sqrt(1/a)
mu_n <- b/a
\[p(\mu,\sigma^2|Y_1,\ldots,Y_n) \propto p(Y_1,\ldots,Y_n| \mu,\sigma^2)p(\mu,\sigma^2)\]
We have seen that if \(\sigma^2\) were known, then a conjugate prior on \(\mu\) is normal so we set \(p(\mu|\sigma^2)\) as normal
We need a distribution that has support on \((0, \infty)\) for \(\sigma^2\) and it turns out Gamma distribution is a crutial element
The joint prior is chosen as follows \[ \begin{align*} \theta|\sigma^2 & \sim \mathrm{normal}\textstyle\left(\mu_0, \frac{\sigma^2}{\kappa_0}\right)\\ \sigma^{-2} & \sim \mathrm{gamma}\textstyle\left(\frac{\nu_0}{2}, \frac{\nu_0\sigma_0}{2}\right) \end{align*} \]
\(\mu_0\) and \(\kappa_0\) can be interpreted as the mean and sample size from a set of prior observations
Then using Bayes’ rule, we can show that (details are left for exercise) \[\{\mu|Y_1,\ldots,Y_n, \sigma^2\} \sim \mathrm{normal}(\mu_n, \sigma^2/\kappa_n),\] where \[\kappa_n = \kappa_0+n\quad\mbox{and}\quad \mu_n=\frac{\kappa_0\mu_0+n\bar{Y}}{\kappa_n},\] and \[\{\sigma^{-2}|Y_1,\ldots,Y_n\} \sim \mathrm{gamma}(\nu_n/2,\nu_n\sigma_n^2/2), \] where \[ \begin{align*} \nu_n &=\nu_0 + n \\ \sigma_n^2 &= \frac{1}{\nu_n}[\nu_0\sigma_0^2 + (n-1)s^2 + \frac{\kappa_0n}{\kappa_n}(\bar{Y}-\mu_0)^2] \end{align*} \] and \(s^2\) is the sample variance of \(Y_1,\ldots,Y_n\)
\(\nu_0\sigma_0^2\) and \(\nu_n\sigma_n^2\) can be interpreted as prior and posterior “sum of squared observations from the sample mean”
mu_0 <- 2
sigma_0 <- 0.6
kappa_0 <- 12
nu_0 <- 12
n <- length(revenue)
Y_bar <- mean(revenue)
s2 <- var(revenue)
kappa_n <- kappa_0 + n
mu_n <- (kappa_0 * mu_0 + n * Y_bar)/kappa_n
nu_n <- nu_0 + n
sigma_n <- sqrt((nu_0 * sigma_0^2 + (n - 1) * s2 + kappa_0 * n/kappa_n * (Y_bar -
mu_0)^2)/nu_n)
The posterior parameters are \(\mu_n = 1.9119\), \(\kappa_n = 21\), \(\nu_n = 21\), and \(\sigma_n^2 = 0.2477\).
The posterior joint distribution of \((\mu,\sigma^2)\) is determined by \[ \begin{align*} \{\mu|Y_1,\ldots,Y_n, \sigma^2\} &\sim \mathrm{normal}(1.9119, \sigma^2/21),\\ \{\sigma^{-2}|Y_1,\ldots,Y_n\} & \sim \mathrm{gamma}(10.5,2.6012), \end{align*} \]
Monte Carlo methods are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results
One runs simulations many times to obtain the information of an unknown probabilistic entity
Mordern version of MC was invented by Stanislaw Ulam in late 1940s, while he was working on nuclear weapons projects at the Los Alamos National Laboratory
Let \(X\) be a random variable with density \(f(x)\) and we are interested in computing \[\mathbb{E}[h(X)] = \int h(x)f(x)\, dx\]
MC generates a sample \(X_1,\ldots,X_n\) from the density \(f\) and approximates the integral with \[\bar{h}_n = \frac{1}{n}\sum_{j=1}^n h(X_j)\]
Convergence (Strong Law of Large Numbers) \[\lim_{n\to\infty}\frac{1}{n}\sum_{j=1}^n h(X_j) = \mathbb{E}[h(X)]\]
N <- 10^4
X <- runif(N)
mean(X^2 + 4 * X * sin(X))
## [1] 1.537
Error is of order \(n^{-1/2}\)
Central Limit Theorem \[\frac{\bar{h}_n - \mathbb{E}[h(X)]}{\sqrt{\mathrm{Var(h(X))}}}\Rightarrow \mathcal N(0, 1)\]
However, \(\mathrm{Var(h(X))}\) is usually unknown and instead we use the sample variance \(s_n^2 = \frac{1}{n^2}\sum_{j=1}^n[h(X_j)-\bar{h}_n]^2\) to approximate it
\[ 0.2\mathcal N(-5, 4) + 0.5\mathcal N(0, 1) + 0.3\mathcal N(3, 0.25) \]
Statistical software packages implement only “standard” probability distributions
Monte Carlo simulation is fundamentally based on the production of uniform random variables between 0 and 1
Then transform them to numbers between 0 and 1 by \[R_i = \frac{X_i}{m}\]
Shortcoming: period is not long enough for modern usage
runif generates pseudo-random numbers and set.seed sets random seed
runif(5)
## [1] 0.1454 0.8607 0.7118 0.5971 0.2862
runif(5)
## [1] 0.6748 0.5171 0.6560 0.5406 0.2624
set.seed(123)
runif(5)
## [1] 0.2876 0.7883 0.4090 0.8830 0.9405
set.seed(123)
runif(5)
## [1] 0.2876 0.7883 0.4090 0.8830 0.9405
N <- 10^4; U <- runif(N);
X <- -log(1 - U) / 3;
mean(X) ## true mean is 1/3
## [1] 0.3286
sd(X) ## true sd is 1/3
## [1] 0.3286
N <- 10^6; U <- runif(N);
system.time(qnorm(U, mean=0, sd=1))
## user system elapsed
## 0.041 0.003 0.043
system.time(rnorm(N, mean=0, sd=1))
## user system elapsed
## 0.091 0.000 0.092
Given a target density \(f\), find a density \(g\) and a constant \(M>0\) such that \[f(x) \leq M g(x)\] for all \(x\) on the support of \(f\)
\[ \begin{align*} P(X\leq x | \mathrm{accept}) = & \frac{P(X\leq x, U\leq \frac{f(X)}{Mg(X)})}{P(U\leq \frac{f(X)}{Mg(X)})} \\ = & \frac{\int g(t) P(X \leq x, U\leq \frac{f(X)}{Mg(X)}|X=t)\, dt}{\int g(t) P(U\leq \frac{f(X)}{Mg(X)}|X=t)\, dt} \\ = & \frac{\int_{-\infty}^x g(t) \frac{f(t)}{Mg(t)}\,dt}{\int_{-\infty}^\infty g(t) \frac{f(t)}{Mg(t)}\,dt} = \int_{-\infty}^x f(t)\,dt \end{align*} \]
\[\frac{f(x)}{g(x)}=\sqrt{\frac{\pi}{2}}(1+x^2)e^{-x^2/2} \] is maximized at \(x=\pm 1\) and the maximum is \(M=\sqrt{\frac{2\pi}{e}}\)
N <- 10^4
M <- sqrt(2 * pi/exp(1))
samples <- rcauchy(N)
U <- runif(N)
A_or_R <- U <= dnorm(samples)/(M * dcauchy(samples))
accepted_samples <- samples[A_or_R]
mean(accepted_samples)
## [1] 0.006658
sd(accepted_samples)
## [1] 1.014
length(accepted_samples)/N ## true P(accept) = 0.6577
## [1] 0.6556
Cauchy tail \(x^{-2}\) is much heavier than the normal tail \(e^{-x^2}\)
\[\frac{f(x)}{g(x)} = \sqrt{\frac{2}{\pi}}e^{|x|-x^2/2}\] is maximized at \(x=\pm 1\) and the maximum is \(M=\sqrt{\frac{2 e}{\pi}}\)
N <- 10^4
M <- sqrt(2 * exp(1)/pi)
samples <- rexp(N) * sample(c(1, -1), N, replace = TRUE)
U <- runif(N)
A_or_R <- U <= dnorm(samples)/(M * 0.5 * dexp(abs(samples)))
accepted_samples <- samples[A_or_R]
mean(accepted_samples)
## [1] -0.008401
sd(accepted_samples)
## [1] 0.9811
length(accepted_samples)/N ## true P(accept) = 0.7602
## [1] 0.7648
\[\frac{f(x)}{g(x)} = \frac{x^{a-1}(1-x)^{b-1}}{B(a,b)} \] is maximized at \(x=\frac{a-1}{a+b-2}\) and its maximum is
\[M = \frac{(a-1)^{a-1}(b-1)^{b-1}}{(a+b-2)^{a+b-2}B(a,b)}\]
N <- 10^4
a <- 2.5
b <- 4.2
M <- (a - 1)^(a - 1) * (b - 1)^(b - 1)/(a + b - 2)^(a + b - 2)/beta(a, b)
samples <- runif(N)
U <- runif(N)
A_or_R <- U <= dbeta(samples, a, b)/M
accepted_samples <- samples[A_or_R]
mean(accepted_samples) ## true mean is a/(a+b) = 0.3731
## [1] 0.3745
var(accepted_samples) ## true variance is ab/(a+b)^2/(a+b+1) = 0.0304
## [1] 0.03065
length(accepted_samples)/N ## true P(accept) = 0.4733
## [1] 0.4675
For \(t=1,\ldots,T\), when at \((x^{(t)}, u^{(t)})\) simulate
Then, \(\{x^{(t)}:t=1,\ldots,T\}\) are samples generated from \(f(x)\)
T <- 10^4
x <- rep(1, T)
y <- rep(0.1, T)
for (t in 1:(T - 1)) {
y[t + 1] <- runif(1, 0, exp(-sqrt(x[t]))/2)
x[t + 1] <- runif(1, 0, (log(2 * y[t + 1]))^2)
}
Target density: \(f(x)\propto f_1(x)=e^{-(x+3)^2/2}\mathbb{I}_{[0, 1]}(x)\)
Horizontal: \(X|u \sim \mathrm{Uniform}(0, \min(1,\sqrt{-2\log(u)}-3))\)
T <- 10^4
x <- rep(0.25, T)
y <- rep(0.0025, T)
for (t in 1:(T - 1)) {
y[t + 1] <- runif(1, 0, exp(-(x[t] + 3)^2/2))
x[t + 1] <- runif(1, 0, min(1, sqrt(-2 * log(y[t + 1])) - 3))
}
\(\{x^{(t)}:t=1,\ldots,T\}\) form a Markov chain and its stationary distribution has the target density \(f(x)\)
Advantage: it does not need an instrumental distribution or the normalizing constant of the target distribution
Disadvantage: it may be difficult to compute the region \(\{x: f(x)\geq u\}\), especially for multi-dimensional cases
Metropolis algorithm was proposed in 1953
Hastings generalized the algorithm in 1970 and introduced it to the statistics community
For \(t=1,\ldots, T\), when at \(x^{(t)}\)
Simulate \(y_t\sim q(y|x^{(t)})\)
Simulate \(U\sim \mathrm{Uniform}(0,1)\) and take \[x^{(t+1)} = \left\{ \begin{array}{ll} y_t, & \mbox{if } U \leq \frac{f(y_t)}{f(x^{(t)})}\frac{q(x^{(t)}|y_t)}{q(y_t|x^{(t)})}\\ x^{(t)}, & \mbox{otherwise.} \end{array} \right.\]
\(x^{(1)}, x^{(2)}, \ldots\) are correlated
Never move to values with \(f(y)=0\)
Always accepts \(y_t\) such that \[\frac{f(y_t)}{q(y_t|x^{(t)})} > \frac{f(x^{(t)})}{q(x^{(t)}|y_t)}\]
If \(y_t\) decreases the ratio, it is sometimes rejected but may still be accepted
For \(t=1,\ldots, T\), when at \(x^{(t)}\)
Simulate \(y_t\sim g(y)\)
Simulate \(U\sim \mathrm{Uniform}(0,1)\) and take \[x^{(t+1)} = \left\{ \begin{array}{ll} y_t, & \mbox{if } U\leq \frac{f(y_t)}{f(x^{(t)})}\frac{g(x^{(t)})}{g(y_t)} \\ x^{(t)}, & \mbox{otherwise.} \end{array} \right.\]
Can be viewed as a generalization of the Acceptance-Rejection method
Instrumental: \(g(x) = \frac{1}{2}e^{-|x|}\)
The acceptance probability in independent MH is \[\min\{1, \exp[(|x^{(t)}|-|y_t|)((|x^{(t)}|+|y_t|) / 2-1)]\}\]
N <- 10^4; x <- rep(0, N); accept_count <- 0;
y <- rexp(N) * sample(c(1, -1), size=N, replace=TRUE) ## simulate double exp.
U <- runif(N)
for (t in 1:(N-1)){
if (U[t] <= exp((abs(x[t])-abs(y[t])) * ((abs(x[t]+abs(y[t])))/2-1))){
accept_count <- accept_count + 1
x[t+1] <- y[t]
}
else
x[t+1] <- x[t]
}
accept_count / N ## acceptance rate for A-R is 0.7602
## [1] 0.8402
This is the original algorithm in Metropolis et al. (1953), thus called the Metropolis algorithm
For \(t=1,\ldots, T\), when at \(x^{(t)}\)
Simulate $y_t g(y - x^{(t)}) $
Simulate \(U\sim \mathrm{Uniform}(0,1)\) and take \[x^{(t+1)} = \left\{ \begin{array}{ll} y_t, & \mbox{if } U\leq \frac{f(y_t)}{f(x^{(t)})} \\ x^{(t)}, & \mbox{otherwise.} \end{array} \right.\]
Otherwise, we sometimes reject the move; the more relative drop in likelihood, the more likely we are to reject the move
So in the long run, the chain tends to stay in high-density regions of \(f(x)\), while occasionally visiting low-density regions
N <- 10^4; x <- rep(0, N); delta <- 0.1; accept_count <- 0;
epsilon <- runif(N, -delta, delta) ## simulate uniform on [-delta, delta]
U <- runif(N)
for (t in 1:(N-1)){
y_t <- x[t] + epsilon[t]
if (U[t] <= exp((x[t]^2 - y_t^2)/2)){
accept_count <- accept_count + 1
x[t+1] <- y_t
}
else
x[t+1] <- x[t]
}
accept_count / N; ## maybe too high...
## [1] 0.981
How do we choose the instrumental distribution \(q\) when developing a MH algorithm?
In contrast to the A-R method, maximizing the acceptance rate is not neccesarily the best, especially for random walk MH
For independent MH, we can optimize the algorithm by maximizing acceptance rate
In small dimensions, aim at an average acceptance rate of 50%. In large dimensions, at an average acceptance rate of 25%.
Proposed by S. Geman and D. Geman in 1984
Named after the physicist J. Gibbs in reference to an analogy between the algotirhm and statistical physics
Target \(f\) is the joint distribution of \(\mathbf{X}=(X_1,\ldots,X_n)\)
Suppose we can simulate from the full conditionals \[X_i | \mathbf{x}_{-i} \sim f_i(x_i| \mathbf{x}_{-i})\] for \(i=1,\ldots,n\), where \(\mathbf{x}_{-i} = (x_1,\ldots,x_{i-1}, x_{i+1},\ldots,x_n)\)
Then we can construct a Markov chain \(\{\mathbf{X}^{(t)} \}\) whose stationary distribution is \(f\)
For \(t=1,\ldots, T\), given \(\mathbf{x}^{(t)} = (x^{(t)}_1,\ldots,x^{(t)}_n)\), simulate
\(X^{(t+1)}_1 \sim f_1(x_1| \mathbf{x}_2^{(t)}, \mathbf{x}_3^{(t)},\ldots, \mathbf{x}_n^{(t)})\)
\(X^{(t+1)}_2 \sim f_2(x_2| \mathbf{x}_1^{(t+1)}, \mathbf{x}_3^{(t)},\ldots, \mathbf{x}_n^{(t)})\) \[\vdots\]
\(X^{(t+1)}_n \sim f_n(x_n| \mathbf{x}_1^{(t+1)}, \mathbf{x}_2^{(t+1)},\ldots, \mathbf{x}_{n-1}^{(t+1)})\)
Target is \(f(x,y) \propto e^{-(x^2y^2+x^2+y^2-8x-8y)/2}\)
The full conditionals are \[f(x|y) \propto e^{-[(y^2+1)x^2-8x]/2} \] and \[f(y|x) \propto e^{-[(x^2+1)y^2-8y]/2}\]
Noting the quadratic form in the exponent, the full conditionals are simply normal distributions \[x|y \sim \mathcal N\left(4/(1+y^2), 1/(1+y^2)\right)\] and \[y|x \sim \mathcal N\left(4/(1+x^2), 1/(1+x^2)\right)\]
N <- 10^4; x <- rep(0, N); y <- rep(0, N);
for (t in 1:(N-1)){
x[t+1] <- rnorm(1, mean=4/(1+y[t]^2), sd=1/sqrt(1+y[t]^2))
y[t+1] <- rnorm(1, mean=4/(1+x[t+1]^2), sd=1/sqrt(1+x[t+1]^2))
}
Slice sampler does not need an instrumental distribution but requires one to compute the “inverse” of the density function
Gibbs sampler converts a multi-dimensional distribution to a sequence of one-dimensional conditional distributions (i.e. full conditionals)
Companies may default especially in a bad economy
\(N(t)\) is the accumulated number of defaults since time 0
\(x(t)\) is the “state” of the economy, which can not be observed directly
Assume \(\lambda(t)=\mu e^{x(t)}\) and \[d x(t) = - x(t) dt + \sigma d W(t),\] where \(W(t)\) is a standard Brownian motion
We are concerned with the estimation of the parameters \(\Theta=(\mu, \kappa, \sigma)\)
T <- 100;
kappa <- 0.3; sigma <- 0.1; mu <- 5;
x_0 <- 0;
x <- rep(x_0, T+1);
for (t in 1:T)
x[t+1] <- rnorm(1, mean=(1-kappa)*x[t], sd=sigma)
y <- rpois(T, mu*exp(x[2:(T+1)]))
A challenge here, and also for other state space models, is that the “states” are unobservable (i.e. latent variables)
Given the data \(\mathbf Y=\{Y_1,\ldots, Y_T\}\), Bayesian inference is interested in the joint posterior distribution of the parameters \(\Theta\) and the state variables \(\mathbf{x}=(x_1,\ldots,x_T)\) \[p(\Theta, \mathbf x|\mathbf Y)\]
Assume the priors of the parameters are independent
Assume \(p(\mu)\) is Gamma with shape \(a_1\) and rate \(b_1\), \[ \begin{align*} p(\mu|\kappa, \sigma^2, \mathbf x,\mathbf Y) \propto & p(\mu|\kappa, \sigma^2) p(\mathbf Y, \mathbf x|\mu, \kappa, \sigma^2) \\ \propto & p(\mu) \prod_{t=1}^T p(Y_t|x_t, \mu) \\ \propto & \mu^{a_1-1}e^{-b_1\mu} \prod_{t=1}^T (\mu e^{x_t})^{Y_t}e^{-\mu e^{x_t}} \\ \propto & \mu^{a_1-1+\sum_t Y_t} \cdot e^{-\mu (b_1+\sum_t e^{x_t})} \end{align*} \] so \(p(\mu|\kappa, \sigma^2, \mathbf x,\mathbf Y)\) is Gamma with shape \(a_1+\sum_t Y_t\) and rate \(b_1+\sum_t e^{x_t}\)
Assume \(p(\kappa)\) is normal with mean \(a_2\) and variance \(b_2^2\). Then \(p(\kappa|\mu, \sigma^2, \mathbf x,\mathbf Y)\) is normal with mean \(B/A\) and variance \(1/A\), where \[A = 1/b_2^2 + \textstyle\sum_t x_t^2/\sigma^2\] and \[B = a_2/b_2^2 - \textstyle\sum_t x_{t-1}(x_t-x_{t-1})/\sigma^2\]
Assume \(p(\sigma^2)\) is inverse Gamma (i.e. \(p(1/\sigma^2)\) is Gamma) with shape \(a_3\) and rate \(b_3\). Then \(p(\sigma^2|\mu,\kappa, \mathbf x,\mathbf Y)\) is inverse Gamma with shape \(a_3+T/2\) and scale \[b_3+\textstyle\sum_t[x_t-(1-\kappa)x_{t-1}]^2/2\]
\(p(x_t|\mathbf x_{-t}, \mathbf Y, \Theta) \propto p(x_{t+1}|x_t, \Theta)p(x_t|x_{t-1}, \Theta)p(Y_t|x_t, \Theta)\), where \[ \begin{align*} p(x_{t+1}|x_t, \Theta)\propto & \exp\left(\textstyle-\frac{[x_{t+1}-(1-\kappa)x_t]^2}{2\sigma^2}\right) \\ p(x_{t}|x_{t-1}, \Theta)\propto & \exp\left(\textstyle-\frac{[x_{t}-(1-\kappa)x_{t-1}]^2}{2\sigma^2}\right) \\ p(Y_t|x_t, \Theta) \propto & \exp\left(\textstyle-\mu e^{x_t}+x_tY_t\right) \end{align*}\]
Note that the tail of \(p(x_t|\mathbf x_{-t}, \mathbf Y, \Theta)\) is of the order \(\exp(-\mu e^{x_t})\), which is very light
For \(k=1,\ldots, n\), given \((\mu^{(k)}, \kappa^{(k)}, \sigma^{(k)}, \mathbf x^{(k)})\),
Simulate \(\mu^{(k+1)}\) from Gamma posterior \(p(\mu|\kappa^{(k)}, \sigma^{(k)}, \mathbf x^{(k)})\)
Simulate \(\kappa^{(k+1)}\) from normal posterior \(p(\kappa|\mu^{(k+1)}, \sigma^{(k)}, \mathbf x^{(k)})\)
Simulate \({\sigma^{(k+1)}}^2\) from inverse Gamma posterior \(p(\sigma^2|\mu^{(k+1)}, \kappa^{(k+1)}, \mathbf x^{(k)})\)
Use independent MH to iteratively simulate \(x^{(k+1)}_t\) from \(p(x_t|\mu^{(k+1)}, \kappa^{(k+1)}, \sigma^{(k+1)}, x_1^{(k+1)}, \ldots, x_{t-1}^{(k+1)}, x_{t+1}^{(k)},\ldots,x_T^{(k)} )\), for \(t=1,\ldots, T\), with a normal instrumental density with mean 0
## mu kappa sigma
## 5.0689 0.3321 0.1101
Bayesian inference + MCMC is powerful!
However, fine turning instrumental distributions and hyperparameters requires great effort!
Do not blindly trust the results. If something is counter-intuitive, make sure find out why!